Question: Are soccer referees more likely to give red cards to dark-skin-toned players than light-skin-toned players?

-> GridSpec: Specifies the geometry of the grid that a subplot can be placed in.
-> tqdm: To show the progress
-> missingno: provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset
In [1]:
%matplotlib inline

import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.pyplot import GridSpec
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
# sns.set_context("poster", font_scale=1.3)

import missingno as msno
import pandas_profiling

from sklearn.datasets import make_blobs
import time
from pandas.plotting import scatter_matrix

DATA:

Variable Name: Variable Description:
playerShort short player ID
player player name
club player club
leagueCountry country of player club (England, Germany, France, and Spain)
height player height (in cm)
weight player weight (in kg)
position player position
games number of games in the player-referee dyad
goals number of goals in the player-referee dyad
yellowCards number of yellow cards player received from the referee
yellowReds number of yellow-red cards player received from the referee
redCards number of red cards player received from the referee
photoID ID of player photo (if available)
rater1 skin rating of photo by rater 1
rater2 skin rating of photo by rater 2
refNum unique referee ID number (referee name removed for anonymizing purposes)
refCountry unique referee country ID number
meanIAT mean implicit bias score (using the race IAT) for referee country
nIAT sample size for race IAT in that particular country
seIAT standard error for mean estimate of race IAT
meanExp mean explicit bias score (using a racial thermometer task) for referee country
nExp sample size for explicit bias in that particular country
seExp standard error for mean estimate of explicit bias measure
In [2]:
#Read the data
data_df = pd.read_csv('../data/redcard.csv')
data_df.head()
Out[2]:
playerShort player club leagueCountry birthday height weight position games victories ... rater2 refNum refCountry Alpha_3 meanIAT nIAT seIAT meanExp nExp seExp
0 lucas-wilchez Lucas Wilchez Real Zaragoza Spain 31.08.1983 177.0 72.0 Attacking Midfielder 1 0 ... 0.50 1 1 GRC 0.326391 712.0 0.000564 0.396000 750.0 0.002696
1 john-utaka John Utaka Montpellier HSC France 08.01.1982 179.0 82.0 Right Winger 1 0 ... 0.75 2 2 ZMB 0.203375 40.0 0.010875 -0.204082 49.0 0.061504
2 abdon-prats Abdón Prats RCD Mallorca Spain 17.12.1992 181.0 79.0 NaN 1 0 ... NaN 3 3 ESP 0.369894 1785.0 0.000229 0.588297 1897.0 0.001002
3 pablo-mari Pablo Marí RCD Mallorca Spain 31.08.1993 191.0 87.0 Center Back 1 1 ... NaN 3 3 ESP 0.369894 1785.0 0.000229 0.588297 1897.0 0.001002
4 ruben-pena Rubén Peña Real Valladolid Spain 18.07.1991 172.0 70.0 Right Midfielder 1 1 ... NaN 3 3 ESP 0.369894 1785.0 0.000229 0.588297 1897.0 0.001002

5 rows × 28 columns

In [3]:
data_df.shape
Out[3]:
(146028, 28)
In [4]:
data_df.describe().T
Out[4]:
count mean std min 25% 50% 75% max
height 145765.0 181.935938 6.738726 1.610000e+02 177.000000 182.000000 187.000000 2.030000e+02
weight 143785.0 76.075662 7.140906 5.400000e+01 71.000000 76.000000 81.000000 1.000000e+02
games 146028.0 2.921166 3.413633 1.000000e+00 1.000000 2.000000 3.000000 4.700000e+01
victories 146028.0 1.278344 1.790725 0.000000e+00 0.000000 1.000000 2.000000 2.900000e+01
ties 146028.0 0.708241 1.116793 0.000000e+00 0.000000 0.000000 1.000000 1.400000e+01
defeats 146028.0 0.934581 1.383059 0.000000e+00 0.000000 1.000000 1.000000 1.800000e+01
goals 146028.0 0.338058 0.906481 0.000000e+00 0.000000 0.000000 0.000000 2.300000e+01
yellowCards 146028.0 0.385364 0.795333 0.000000e+00 0.000000 0.000000 1.000000 1.400000e+01
yellowReds 146028.0 0.011381 0.107931 0.000000e+00 0.000000 0.000000 0.000000 3.000000e+00
redCards 146028.0 0.012559 0.112889 0.000000e+00 0.000000 0.000000 0.000000 2.000000e+00
rater1 124621.0 0.264255 0.295382 0.000000e+00 0.000000 0.250000 0.250000 1.000000e+00
rater2 124621.0 0.302862 0.293020 0.000000e+00 0.000000 0.250000 0.500000 1.000000e+00
refNum 146028.0 1534.827444 918.736625 1.000000e+00 641.000000 1604.000000 2345.000000 3.147000e+03
refCountry 146028.0 29.642842 27.496189 1.000000e+00 7.000000 21.000000 44.000000 1.610000e+02
meanIAT 145865.0 0.346276 0.032246 -4.725423e-02 0.334684 0.336628 0.369894 5.737933e-01
nIAT 145865.0 19697.411216 127126.197143 2.000000e+00 1785.000000 2882.000000 7749.000000 1.975803e+06
seIAT 145865.0 0.000631 0.004736 2.235373e-07 0.000055 0.000151 0.000229 2.862871e-01
meanExp 145865.0 0.452026 0.217469 -1.375000e+00 0.336101 0.356446 0.588297 1.800000e+00
nExp 145865.0 20440.233860 130615.745103 2.000000e+00 1897.000000 3011.000000 7974.000000 2.029548e+06
seExp 145865.0 0.002994 0.019723 1.043334e-06 0.000225 0.000586 0.001002 1.060660e+00
In [5]:
data_df.dtypes
Out[5]:
playerShort       object
player            object
club              object
leagueCountry     object
birthday          object
height           float64
weight           float64
position          object
games              int64
victories          int64
ties               int64
defeats            int64
goals              int64
yellowCards        int64
yellowReds         int64
redCards           int64
photoID           object
rater1           float64
rater2           float64
refNum             int64
refCountry         int64
Alpha_3           object
meanIAT          float64
nIAT             float64
seIAT            float64
meanExp          float64
nExp             float64
seExp            float64
dtype: object
In [6]:
columns = data_df.columns.tolist()
columns
Out[6]:
['playerShort',
 'player',
 'club',
 'leagueCountry',
 'birthday',
 'height',
 'weight',
 'position',
 'games',
 'victories',
 'ties',
 'defeats',
 'goals',
 'yellowCards',
 'yellowReds',
 'redCards',
 'photoID',
 'rater1',
 'rater2',
 'refNum',
 'refCountry',
 'Alpha_3',
 'meanIAT',
 'nIAT',
 'seIAT',
 'meanExp',
 'nExp',
 'seExp']

Questions on the dataset

How do we operationalize the question of referees giving more red cards to dark skinned players?

  • Counterfactual: if the player were lighter, a ref is more likely to have given a yellow or no card for the same offense under the same conditions
  • Regression: accounting for confounding, darker players have positive coefficient on regression against proportion red/total card

Potential issues

  • How to combine rater1 and rater2? Average them? What if they disagree? Throw it out?
  • Is data imbalanced, i.e. red cards are very rare?
  • Is data biased, i.e. players have different amounts of play time? Is this a summary of their whole career?
  • How do I know I've accounted for all forms of confounding?

First, is there systematic discrimination across all refs?

Exploration/hypotheses:

  • Distribution of games played
  • red cards vs games played
  • Reds per game played vs total cards per game played by skin color
  • Distribution of # red, # yellow, total cards, and fraction red per game played for all players by avg skin color
  • How many refs did players encounter?
  • Do some clubs play more aggresively and get carded more? Or are more reserved and get less?
  • Does carding vary by leagueCountry?
  • Do high scorers get more slack (fewer cards) for the same position?
  • Are there some referees that give more red/yellow cards than others?
  • how consistent are raters? Check with Cohen's kappa.
  • how do red cards vary by position? e.g. defenders get more?
  • Do players with more games get more cards, and is there difference across skin color?
  • indication of bias depending on refCountry?

Data Cleaning

Create multiple tables for

  • players
  • refrees
  • clubs
  • countries
  • dyands (player-refree combo)

Players

In [7]:
players_index = 'playerShort'
players_cols = [
               'player',
               'birthday',
               'height',
               'weight',
               'position',
               'photoID',
               'rater1',
               'rater2',
]
In [8]:
#Check if we have duplicate entries
#If the same player has different values in differnt rows, then the cell value will be > 1
all_cols_unique_players = data_df.groupby('playerShort').agg({col:'nunique' for col in players_cols})
all_cols_unique_players.head()
Out[8]:
player birthday height weight position photoID rater1 rater2
playerShort
aaron-hughes 1 1 1 1 1 1 1 1
aaron-hunt 1 1 1 1 1 1 1 1
aaron-lennon 1 1 1 1 1 1 1 1
aaron-ramsey 1 1 1 1 1 1 1 1
abdelhamid-el-kaoutari 1 1 1 1 1 1 1 1
In [9]:
all_cols_unique_players[all_cols_unique_players > 1].dropna().shape[0] == 0
Out[9]:
True

This being True says that all the values are same per player across multiple rows

In [10]:
def get_subgroup(dataframe, g_index, g_columns):
    '''
    purpose: Group the dataframe based on index and check for the group's uniqueness
    parameters:
        dataframe: DF on which grouping and uniqueness test needs to be done
        g_index: Index for grouping
        g_columns: Columns for grouping
    return:
        Dataframe with unique entries and max value for each column
    '''
    g = dataframe.groupby(g_index).agg({col:'nunique' for col in g_columns})
    if g[g > 1].dropna().shape[0] != 0:
        print("This group doesn't have unique values")
    return dataframe.groupby(g_index).agg({col:'max' for col in g_columns})
In [11]:
players = get_subgroup(data_df,players_index, players_cols)
players.head()
Out[11]:
player birthday height weight position photoID rater1 rater2
playerShort
aaron-hughes Aaron Hughes 08.11.1979 182.0 71.0 Center Back 3868.jpg 0.25 0.00
aaron-hunt Aaron Hunt 04.09.1986 183.0 73.0 Attacking Midfielder 20136.jpg 0.00 0.25
aaron-lennon Aaron Lennon 16.04.1987 165.0 63.0 Right Midfielder 13515.jpg 0.25 0.25
aaron-ramsey Aaron Ramsey 26.12.1990 178.0 76.0 Center Midfielder 94953.jpg 0.00 0.00
abdelhamid-el-kaoutari Abdelhamid El-Kaoutari 17.03.1990 180.0 73.0 Center Back 124913.jpg 0.25 0.25

Club

In [12]:
club_index = 'club'
club_cols = ['leagueCountry']
clubs = get_subgroup(data_df,club_index,club_cols)
clubs.head()
Out[12]:
leagueCountry
club
1. FC Nürnberg Germany
1. FSV Mainz 05 Germany
1899 Hoffenheim Germany
AC Ajaccio France
AFC Bournemouth England
In [13]:
clubs.leagueCountry.value_counts()
Out[13]:
England    48
Spain      27
France     22
Germany    21
Name: leagueCountry, dtype: int64

Referee

In [14]:
referee_index = 'refNum'
referee_cols = ['refCountry']
refrees = get_subgroup(data_df,referee_index,referee_cols)
refrees.head()
Out[14]:
refCountry
refNum
1 1
2 2
3 3
4 4
5 5
In [15]:
refrees.refCountry.nunique()
Out[15]:
161

Countries

In [16]:
countries_index = 'refCountry'
countries_cols = [
                'Alpha_3', # rename this name of country
                'meanIAT',
                'nIAT',
                'seIAT',
                'meanExp',
                'nExp',
                'seExp',
]
countries = get_subgroup(data_df,countries_index,countries_cols)
countries.head()
Out[16]:
Alpha_3 meanIAT nIAT seIAT meanExp nExp seExp
refCountry
1 GRC 0.326391 712.0 0.000564 0.396000 750.0 0.002696
2 ZMB 0.203375 40.0 0.010875 -0.204082 49.0 0.061504
3 ESP 0.369894 1785.0 0.000229 0.588297 1897.0 0.001002
4 LUX 0.325185 127.0 0.003297 0.538462 130.0 0.013752
5 TUN 0.167132 19.0 0.027327 -0.789474 19.0 0.111757
In [17]:
rename_cols = {'Alpha_3':'countryName'}
countries = countries.rename(columns=rename_cols)
countries.head()
Out[17]:
countryName meanIAT nIAT seIAT meanExp nExp seExp
refCountry
1 GRC 0.326391 712.0 0.000564 0.396000 750.0 0.002696
2 ZMB 0.203375 40.0 0.010875 -0.204082 49.0 0.061504
3 ESP 0.369894 1785.0 0.000229 0.588297 1897.0 0.001002
4 LUX 0.325185 127.0 0.003297 0.538462 130.0 0.013752
5 TUN 0.167132 19.0 0.027327 -0.789474 19.0 0.111757

Dyads

In [18]:
dyands_index = ['refNum', 'playerShort']
dyands_cols = [
            'games',
             'victories',
             'ties',
             'defeats',
             'goals',
             'yellowCards',
             'yellowReds',
             'redCards',
]
dyands = get_subgroup(data_df,dyands_index,dyands_cols)
dyands.head(10)
Out[18]:
games victories ties defeats goals yellowCards yellowReds redCards
refNum playerShort
1 lucas-wilchez 1 0 0 1 0 0 0 0
2 john-utaka 1 0 0 1 0 1 0 0
3 abdon-prats 1 0 1 0 0 1 0 0
pablo-mari 1 1 0 0 0 0 0 0
ruben-pena 1 1 0 0 0 0 0 0
4 aaron-hughes 1 0 0 1 0 0 0 0
aleksandar-kolarov 1 1 0 0 0 0 0 0
alexander-tettey 1 0 0 1 0 0 0 0
anders-lindegaard 1 0 1 0 0 0 0 0
andreas-beck 1 1 0 0 0 0 0 0
In [19]:
#get those rows where #redcards > 1
dyands[dyands['redCards']>1]
Out[19]:
games victories ties defeats goals yellowCards yellowReds redCards
refNum playerShort
140 bodipo 6 2 1 3 1 0 0 2
367 antonio-lopez_2 8 5 2 1 0 2 0 2
432 javi-martinez 14 4 3 7 2 2 0 2
jonas 9 1 4 4 1 0 0 2
487 phil-jagielka 7 2 1 4 1 0 0 2
586 cyril-jeunechamp 14 8 0 6 0 6 0 2
804 sergio-ramos 18 12 1 5 4 6 1 2
985 aly-cissokho 9 1 5 3 1 1 0 2
1114 eugen-polanski 8 4 0 4 0 0 0 2
1214 emmanuel-adebayor 23 9 7 7 10 4 1 2
1349 jose-nunes 6 2 2 2 0 0 0 2
1574 jesus-rueda 6 0 2 4 0 0 0 2
1696 marc-planus 19 10 6 3 0 1 0 2
1886 de-lucas 10 3 3 4 0 1 0 2
1931 fallou-diagne 3 0 0 3 0 0 0 2
2080 laurent-koscielny 14 5 2 7 1 4 0 2
2246 saber-khelifa 5 1 1 3 0 0 0 2
2370 ivo-ilicevic 9 3 4 2 0 1 0 2
2796 fabien-audard 8 2 3 3 0 0 0 2
2803 cristiano-ronaldo 22 15 3 4 9 4 0 2
2822 gary-cahill 5 2 1 2 0 0 0 2
lee-cattermole 4 1 2 1 0 1 0 2
2961 cyril-jeunechamp 11 5 3 3 0 3 0 2
mouhamadou-dabo 4 3 0 1 0 0 0 2
3099 rafael-van-der-vaart 14 6 2 6 3 4 0 2
In [20]:
dyands.redCards.max()
Out[20]:
2

Analysis

Players

In [21]:
players.shape
Out[21]:
(2053, 8)
In [22]:
players.columns
Out[22]:
Index(['player', 'birthday', 'height', 'weight', 'position', 'photoID',
       'rater1', 'rater2'],
      dtype='object')
In [23]:
#players name adds no value
players.drop('player', axis=1, inplace=True)
players.head()
Out[23]:
birthday height weight position photoID rater1 rater2
playerShort
aaron-hughes 08.11.1979 182.0 71.0 Center Back 3868.jpg 0.25 0.00
aaron-hunt 04.09.1986 183.0 73.0 Attacking Midfielder 20136.jpg 0.00 0.25
aaron-lennon 16.04.1987 165.0 63.0 Right Midfielder 13515.jpg 0.25 0.25
aaron-ramsey 26.12.1990 178.0 76.0 Center Midfielder 94953.jpg 0.00 0.00
abdelhamid-el-kaoutari 17.03.1990 180.0 73.0 Center Back 124913.jpg 0.25 0.25
In [24]:
#Visualize the missing data on a sample of 500 records
msno.matrix(players.sample(500),
            figsize=(16,7),
            width_ratios=(15,1)
           )
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce00124d48>
In [25]:
msno.bar(
    players.sample(500),
    figsize=(16,7)
)
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce00258608>

Insight: Most of the players have no rating from either of the raters.

Action: Does the missing data have any correlations? -> Genrate a heatmap

In [26]:
msno.heatmap(players.sample(500),figsize=(16,7))
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce004cbdc8>

Insight: PhotoID is correlated with the rater1 and rater2

In [27]:
#Check for number of missing values
print("# players: ", len(players))
print("# rater1 missing: ", len(players[(players.rater1.isnull())]))
print("# rater2 missing: ", len(players[(players.rater2.isnull())]))
print("# rater1 & rater2 missing: ", len(players[(players.rater1.isnull()) & (players.rater2.isnull())]))
# players:  2053
# rater1 missing:  468
# rater2 missing:  468
# rater1 & rater2 missing:  468
In [28]:
#Get the dataframe with no nulls
players = players[players.rater1.notnull()]
players.head()
Out[28]:
birthday height weight position photoID rater1 rater2
playerShort
aaron-hughes 08.11.1979 182.0 71.0 Center Back 3868.jpg 0.25 0.00
aaron-hunt 04.09.1986 183.0 73.0 Attacking Midfielder 20136.jpg 0.00 0.25
aaron-lennon 16.04.1987 165.0 63.0 Right Midfielder 13515.jpg 0.25 0.25
aaron-ramsey 26.12.1990 178.0 76.0 Center Midfielder 94953.jpg 0.00 0.00
abdelhamid-el-kaoutari 17.03.1990 180.0 73.0 Center Back 124913.jpg 0.25 0.25
In [29]:
len(players)
Out[29]:
1585
In [30]:
msno.matrix(players.sample(500),figsize=(16,7))
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce00808448>

Now that each player has both ratings, how are combining them? But, before combinig, are they close enough to combine? How to check close enough? -> Correlations

In [31]:
fig, ax = plt.subplots(figsize=(12,10))
sns.heatmap(pd.crosstab(players.rater1, players.rater2), cmap='Blues', annot=True, fmt = 'd', ax=ax)
ax.set_title('Correlation between Rater1 & Rater2')
fig.tight_layout()
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
Out[31]:
(5.0, 0.0)

From this, we can see that, both ther raters rated approximately the same. Therefore combine the ratings by averaging them.

In [32]:
players.head()
Out[32]:
birthday height weight position photoID rater1 rater2
playerShort
aaron-hughes 08.11.1979 182.0 71.0 Center Back 3868.jpg 0.25 0.00
aaron-hunt 04.09.1986 183.0 73.0 Attacking Midfielder 20136.jpg 0.00 0.25
aaron-lennon 16.04.1987 165.0 63.0 Right Midfielder 13515.jpg 0.25 0.25
aaron-ramsey 26.12.1990 178.0 76.0 Center Midfielder 94953.jpg 0.00 0.00
abdelhamid-el-kaoutari 17.03.1990 180.0 73.0 Center Back 124913.jpg 0.25 0.25
In [33]:
players['skinTone'] = players[['rater1','rater2']].mean(axis=1)
players.head()
Out[33]:
birthday height weight position photoID rater1 rater2 skinTone
playerShort
aaron-hughes 08.11.1979 182.0 71.0 Center Back 3868.jpg 0.25 0.00 0.125
aaron-hunt 04.09.1986 183.0 73.0 Attacking Midfielder 20136.jpg 0.00 0.25 0.125
aaron-lennon 16.04.1987 165.0 63.0 Right Midfielder 13515.jpg 0.25 0.25 0.250
aaron-ramsey 26.12.1990 178.0 76.0 Center Midfielder 94953.jpg 0.00 0.00 0.000
abdelhamid-el-kaoutari 17.03.1990 180.0 73.0 Center Back 124913.jpg 0.25 0.25 0.250

Analyze the data distributions

skinTone

In [34]:
sns.distplot(players.skinTone,kde=False)
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce02e6fcc8>

Position

the position of the player might affect the redCard. like, chances of a defender getting a red might be more than a keeper

In [35]:
MIDSIZE = (12,8)
fig,ax = plt.subplots(figsize=MIDSIZE)
players.position.value_counts(dropna=False, ascending=True).plot(kind='barh',ax=ax)
ax.set_ylabel("Position")
ax.set_xlabel("Counts")
# fig.tight_layout()
Out[35]:
Text(0.5, 0, 'Counts')

Create high level categories for positions

In [36]:
positions = players.position.unique()
positions
Out[36]:
array(['Center Back', 'Attacking Midfielder', 'Right Midfielder',
       'Center Midfielder', 'Goalkeeper', 'Defensive Midfielder',
       'Left Fullback', nan, 'Left Midfielder', 'Right Fullback',
       'Center Forward', 'Left Winger', 'Right Winger'], dtype=object)
In [37]:
defense = ['Center Back','Defensive Midfielder', 'Left Fullback', 'Right Fullback']
midfield = ['Right Midfielder', 'Center Midfielder', 'Left Midfielder']
forward = ['Attacking Midfielder', 'Left Winger', 'Right Winger', 'Center Forward']
keeper = ['Goalkeeper']

#Modify the DF
players.loc[players['position'].isin(defense), 'position_agg'] = 'Defence'
players.loc[players['position'].isin(midfield), 'position_agg'] = 'MidField'
players.loc[players['position'].isin(forward), 'position_agg'] = 'Forward'
players.loc[players['position'].isin(keeper), 'position_agg'] = 'Keeper'
players.head()
Out[37]:
birthday height weight position photoID rater1 rater2 skinTone position_agg
playerShort
aaron-hughes 08.11.1979 182.0 71.0 Center Back 3868.jpg 0.25 0.00 0.125 Defence
aaron-hunt 04.09.1986 183.0 73.0 Attacking Midfielder 20136.jpg 0.00 0.25 0.125 Forward
aaron-lennon 16.04.1987 165.0 63.0 Right Midfielder 13515.jpg 0.25 0.25 0.250 MidField
aaron-ramsey 26.12.1990 178.0 76.0 Center Midfielder 94953.jpg 0.00 0.00 0.000 MidField
abdelhamid-el-kaoutari 17.03.1990 180.0 73.0 Center Back 124913.jpg 0.25 0.25 0.250 Defence
In [38]:
MIDSIZE = (12,8)
fig,ax = plt.subplots(figsize=MIDSIZE)
players.position_agg.value_counts(dropna=False, ascending=True).plot(kind='barh',ax=ax)
ax.set_ylabel("Position")
ax.set_xlabel("Counts")
# fig.tight_layout()
Out[38]:
Text(0.5, 0, 'Counts')
In [39]:
#Examine pair-wise relationships
fig, ax = plt.subplots(figsize=(10,10))
scatter_matrix(players[['height','weight','skinTone']], alpha=0.2, diagonal='hist', ax=ax)
Out[39]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001CE02FB8848>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001CE055196C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001CE056DFF88>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001CE0571C988>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001CE05754A48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001CE05787D08>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001CE057BFE08>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001CE057F6F48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001CE05801B48>]],
      dtype=object)
In [40]:
fig, ax = plt.subplots(figsize=MIDSIZE)
sns.regplot('weight', 'height', data=players, ax=ax)
ax.set_ylabel("Height [cm]")
ax.set_xlabel("Weight [kg]")
fig.tight_layout()

Insight: No outliers. But if we create different classes of heights and weights, can we see any pattern?

In [41]:
weight_classes = [
    'vlow_wt',
    'low_wt',
    'avg_wt',
    'high_wt',
    'vhigh_wt'
]
players['weightClass'] = pd.qcut(players['weight'], len(weight_classes), weight_classes)
players.head()
Out[41]:
birthday height weight position photoID rater1 rater2 skinTone position_agg weightClass
playerShort
aaron-hughes 08.11.1979 182.0 71.0 Center Back 3868.jpg 0.25 0.00 0.125 Defence low_wt
aaron-hunt 04.09.1986 183.0 73.0 Attacking Midfielder 20136.jpg 0.00 0.25 0.125 Forward low_wt
aaron-lennon 16.04.1987 165.0 63.0 Right Midfielder 13515.jpg 0.25 0.25 0.250 MidField vlow_wt
aaron-ramsey 26.12.1990 178.0 76.0 Center Midfielder 94953.jpg 0.00 0.00 0.000 MidField avg_wt
abdelhamid-el-kaoutari 17.03.1990 180.0 73.0 Center Back 124913.jpg 0.25 0.25 0.250 Defence low_wt
In [42]:
height_classes = [
    'vlow_ht',
    'low_ht',
    'avg_ht',
    'high_ht',
    'vhigh_ht'
]
players['heightClass'] = pd.qcut(players['height'], len(height_classes), height_classes)
players.head()    
Out[42]:
birthday height weight position photoID rater1 rater2 skinTone position_agg weightClass heightClass
playerShort
aaron-hughes 08.11.1979 182.0 71.0 Center Back 3868.jpg 0.25 0.00 0.125 Defence low_wt avg_ht
aaron-hunt 04.09.1986 183.0 73.0 Attacking Midfielder 20136.jpg 0.00 0.25 0.125 Forward low_wt avg_ht
aaron-lennon 16.04.1987 165.0 63.0 Right Midfielder 13515.jpg 0.25 0.25 0.250 MidField vlow_wt vlow_ht
aaron-ramsey 26.12.1990 178.0 76.0 Center Midfielder 94953.jpg 0.00 0.00 0.000 MidField avg_wt low_ht
abdelhamid-el-kaoutari 17.03.1990 180.0 73.0 Center Back 124913.jpg 0.25 0.25 0.250 Defence low_wt low_ht

Pandas Profiling

In [43]:
pandas_profiling.ProfileReport(players)








Out[43]:

In [44]:
# modifying dataframe
players['birth_date'] = pd.to_datetime(players.birthday, format='%d.%m.%Y')
players['age_years'] = ((pd.to_datetime("2013-01-01") - players['birth_date']).dt.days)/365.25
In [45]:
players_cleaned_variables = [#'birthday',
                             'height',
                             'weight',
#                              'position',
#                              'photoID',
#                              'rater1',
#                              'rater2',
                             
                             'position_agg',
                             'weightClass',
                             'heightClass',
                             'skinTone',
#                              'birth_date',
                             'age_years']
In [46]:
pandas_profiling.ProfileReport(players[players_cleaned_variables])








Out[46]:

In [47]:
clean_players = players[players_cleaned_variables]
clean_players.head()
Out[47]:
height weight position_agg weightClass heightClass skinTone age_years
playerShort
aaron-hughes 182.0 71.0 Defence low_wt avg_ht 0.125 33.149897
aaron-hunt 183.0 73.0 Forward low_wt avg_ht 0.125 26.327173
aaron-lennon 165.0 63.0 MidField vlow_wt vlow_ht 0.250 25.713895
aaron-ramsey 178.0 76.0 MidField avg_wt low_ht 0.000 22.017796
abdelhamid-el-kaoutari 180.0 73.0 Defence low_wt low_ht 0.250 22.795346
In [48]:
dyands.head()
Out[48]:
games victories ties defeats goals yellowCards yellowReds redCards
refNum playerShort
1 lucas-wilchez 1 0 0 1 0 0 0 0
2 john-utaka 1 0 0 1 0 1 0 0
3 abdon-prats 1 0 1 0 0 1 0 0
pablo-mari 1 1 0 0 0 0 0 0
ruben-pena 1 1 0 0 0 0 0 0
In [49]:
dyands['totalRedCards'] = dyands['yellowReds'] + dyands['redCards']
dyands.rename(columns={'redCards':'strictRedCards'}, inplace=True)
dyands.head()
Out[49]:
games victories ties defeats goals yellowCards yellowReds strictRedCards totalRedCards
refNum playerShort
1 lucas-wilchez 1 0 0 1 0 0 0 0 0
2 john-utaka 1 0 0 1 0 1 0 0 0
3 abdon-prats 1 0 1 0 0 1 0 0 0
pablo-mari 1 1 0 0 0 0 0 0 0
ruben-pena 1 1 0 0 0 0 0 0 0
In [50]:
#Remove records for the players who don't have skintone
dyands.reset_index().head()
Out[50]:
refNum playerShort games victories ties defeats goals yellowCards yellowReds strictRedCards totalRedCards
0 1 lucas-wilchez 1 0 0 1 0 0 0 0 0
1 2 john-utaka 1 0 0 1 0 1 0 0 0
2 3 abdon-prats 1 0 1 0 0 1 0 0 0
3 3 pablo-mari 1 1 0 0 0 0 0 0 0
4 3 ruben-pena 1 1 0 0 0 0 0 0 0
In [51]:
dyands.reset_index().set_index('playerShort').head()
Out[51]:
refNum games victories ties defeats goals yellowCards yellowReds strictRedCards totalRedCards
playerShort
lucas-wilchez 1 1 0 0 1 0 0 0 0 0
john-utaka 2 1 0 0 1 0 1 0 0 0
abdon-prats 3 1 0 1 0 0 1 0 0 0
pablo-mari 3 1 1 0 0 0 0 0 0 0
ruben-pena 3 1 1 0 0 0 0 0 0 0
In [54]:
#Merge player & referee
player_dyand = clean_players.merge(
    dyands.reset_index().set_index('playerShort'),
    left_index=True,
    right_index=True
)
player_dyand.head()
Out[54]:
height weight position_agg weightClass heightClass skinTone age_years refNum games victories ties defeats goals yellowCards yellowReds strictRedCards totalRedCards
playerShort
aaron-hughes 182.0 71.0 Defence low_wt avg_ht 0.125 33.149897 4 1 0 0 1 0 0 0 0 0
aaron-hughes 182.0 71.0 Defence low_wt avg_ht 0.125 33.149897 66 1 1 0 0 0 0 0 0 0
aaron-hughes 182.0 71.0 Defence low_wt avg_ht 0.125 33.149897 77 26 13 8 5 0 0 0 0 0
aaron-hughes 182.0 71.0 Defence low_wt avg_ht 0.125 33.149897 163 2 1 1 0 0 0 0 0 0
aaron-hughes 182.0 71.0 Defence low_wt avg_ht 0.125 33.149897 194 16 3 5 8 0 2 0 0 0
In [55]:
clean_dyands = (dyands.reset_index()[dyands.reset_index()
                                   .playerShort
                                   .isin(set(clean_players.index))
                                  ]).set_index(['refNum', 'playerShort'])
clean_dyands.head()
Out[55]:
games victories ties defeats goals yellowCards yellowReds strictRedCards totalRedCards
refNum playerShort
1 lucas-wilchez 1 0 0 1 0 0 0 0 0
2 john-utaka 1 0 0 1 0 1 0 0 0
4 aaron-hughes 1 0 0 1 0 0 0 0 0
aleksandar-kolarov 1 1 0 0 0 0 0 0 0
alexander-tettey 1 0 0 1 0 0 0 0 0
In [56]:
#Disagreedate the player-referee combo

colnames = ['games', 'totalRedCards']
j = 0
out = [0 for _ in range(sum(clean_dyands['games']))]

for index, row in clean_dyands.reset_index().iterrows():
    n = row['games']
    d = row['totalRedCards']
    ref = row['refNum']
    player = row['playerShort']
    for _ in range(n):
        row['totalRedCards'] = 1 if (d-_) > 0 else 0
        rowlist=list([ref, player, row['totalRedCards']])
        out[j] = rowlist
        j += 1

tidy_dyands = pd.DataFrame(out, columns=['refNum', 'playerShort', 'redcard'],).set_index(['refNum', 'playerShort'])
tidy_dyands.head()
Out[56]:
redcard
refNum playerShort
1 lucas-wilchez 0
2 john-utaka 0
4 aaron-hughes 0
aleksandar-kolarov 0
alexander-tettey 0
In [58]:
tidy_dyands.redcard.sum()
Out[58]:
3092
In [60]:
clean_dyands.games.sum()
Out[60]:
373067
In [61]:
!conda install pivottablejs -y
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/win-64::alabaster==0.7.12=py37_0
  - defaults/win-64::anaconda==2019.10=py37_0
  - defaults/win-64::anaconda-client==1.7.2=py37_0
  - defaults/win-64::anaconda-navigator==1.9.7=py37_0
  - defaults/noarch::anaconda-project==0.8.3=py_0
  - defaults/win-64::asn1crypto==1.0.1=py37_0
  - defaults/win-64::astroid==2.3.1=py37_0
  - defaults/win-64::astropy==3.2.1=py37he774522_0
  - defaults/win-64::atomicwrites==1.3.0=py37_1
  - defaults/noarch::attrs==19.2.0=py_0
  - defaults/noarch::babel==2.7.0=py_0
  - defaults/win-64::backcall==0.1.0=py37_0
  - defaults/noarch::backports==1.0=py_2
  - defaults/noarch::backports.functools_lru_cache==1.5=py_2
  - defaults/win-64::backports.os==0.1.1=py37_0
  - defaults/win-64::backports.shutil_get_terminal_size==1.0.0=py37_2
  - defaults/noarch::backports.tempfile==1.0=py_1
  - defaults/noarch::backports.weakref==1.0.post1=py_1
  - defaults/win-64::beautifulsoup4==4.8.0=py37_0
  - defaults/win-64::bitarray==1.0.1=py37he774522_0
  - defaults/win-64::bkcharts==0.2=py37_0
  - defaults/win-64::bleach==3.1.0=py37_0
  - defaults/win-64::bokeh==1.3.4=py37_0
  - defaults/win-64::boto==2.49.0=py37_0
  - defaults/win-64::bottleneck==1.2.1=py37h452e1ab_1
  - defaults/win-64::certifi==2019.9.11=py37_0
  - defaults/win-64::cffi==1.12.3=py37h7a1dbc1_0
  - defaults/win-64::chardet==3.0.4=py37_1003
  - defaults/win-64::click==7.0=py37_0
  - defaults/noarch::cloudpickle==1.2.2=py_0
  - defaults/win-64::clyent==1.2.2=py37_1
  - defaults/win-64::colorama==0.4.1=py37_0
  - defaults/win-64::comtypes==1.1.7=py37_0
  - defaults/win-64::conda==4.7.12=py37_0
  - defaults/win-64::conda-build==3.18.9=py37_3
  - defaults/win-64::conda-package-handling==1.6.0=py37h62dcd97_0
  - defaults/noarch::conda-verify==3.4.2=py_1
  - defaults/win-64::console_shortcut==0.1.1=3
  - defaults/noarch::contextlib2==0.6.0=py_0
  - defaults/win-64::cryptography==2.7=py37h7a1dbc1_0
  - defaults/win-64::cycler==0.10.0=py37_0
  - defaults/win-64::cython==0.29.13=py37ha925a31_0
  - defaults/win-64::cytoolz==0.10.0=py37he774522_0
  - defaults/noarch::dask==2.5.2=py_0
  - defaults/noarch::dask-core==2.5.2=py_0
  - defaults/win-64::decorator==4.4.0=py37_1
  - defaults/noarch::defusedxml==0.6.0=py_0
  - defaults/noarch::distributed==2.5.2=py_0
  - defaults/win-64::docutils==0.15.2=py37_0
  - defaults/win-64::entrypoints==0.3=py37_0
  - defaults/win-64::et_xmlfile==1.0.1=py37_0
  - defaults/win-64::fastcache==1.1.0=py37he774522_0
  - defaults/noarch::filelock==3.0.12=py_0
  - defaults/noarch::flask==1.1.1=py_0
  - defaults/noarch::fsspec==0.5.2=py_0
  - defaults/win-64::future==0.17.1=py37_0
  - defaults/win-64::get_terminal_size==1.0.0=h38e98db_0
  - defaults/win-64::gevent==1.4.0=py37he774522_0
  - defaults/noarch::glob2==0.7=py_0
  - defaults/win-64::greenlet==0.4.15=py37hfa6e2cd_0
  - defaults/win-64::h5py==2.9.0=py37h5e291fa_0
  - defaults/noarch::heapdict==1.0.1=py_0
  - defaults/win-64::html5lib==1.0.1=py37_0
  - defaults/win-64::idna==2.8=py37_0
  - defaults/win-64::imageio==2.6.0=py37_0
  - defaults/win-64::imagesize==1.1.0=py37_0
  - defaults/win-64::importlib_metadata==0.23=py37_0
  - defaults/win-64::ipykernel==5.1.2=py37h39e3cac_0
  - defaults/win-64::ipython==7.8.0=py37h39e3cac_0
  - defaults/win-64::ipython_genutils==0.2.0=py37_0
  - defaults/noarch::ipywidgets==7.5.1=py_0
  - defaults/win-64::isort==4.3.21=py37_0
  - defaults/win-64::itsdangerous==1.1.0=py37_0
  - defaults/noarch::jdcal==1.4.1=py_0
  - defaults/win-64::jedi==0.15.1=py37_0
  - defaults/noarch::jinja2==2.10.3=py_0
  - defaults/win-64::joblib==0.13.2=py37_0
  - defaults/noarch::json5==0.8.5=py_0
  - defaults/win-64::jsonschema==3.0.2=py37_0
  - defaults/win-64::jupyter==1.0.0=py37_7
  - defaults/noarch::jupyterlab==1.1.4=pyhf63ae98_0
  - defaults/noarch::jupyterlab_server==1.0.6=py_0
  - defaults/win-64::jupyter_client==5.3.3=py37_1
  - defaults/win-64::jupyter_console==6.0.0=py37_0
  - defaults/noarch::jupyter_core==4.5.0=py_0
  - defaults/win-64::keyring==18.0.0=py37_0
  - defaults/win-64::kiwisolver==1.1.0=py37ha925a31_0
  - defaults/win-64::lazy-object-proxy==1.4.2=py37he774522_0
  - defaults/win-64::llvmlite==0.29.0=py37ha925a31_0
  - defaults/win-64::locket==0.2.0=py37_1
  - defaults/win-64::lxml==4.4.1=py37h1350720_0
  - defaults/win-64::markupsafe==1.1.1=py37he774522_0
  - defaults/win-64::matplotlib==3.1.1=py37hc8f65d3_0
  - defaults/win-64::mccabe==0.6.1=py37_1
  - defaults/win-64::menuinst==1.4.16=py37he774522_0
  - defaults/win-64::mistune==0.8.4=py37he774522_0
  - defaults/win-64::mkl-service==2.3.0=py37hb782905_0
  - defaults/win-64::mkl_fft==1.0.14=py37h14836fe_0
  - defaults/win-64::mkl_random==1.1.0=py37h675688f_0
  - defaults/win-64::mock==3.0.5=py37_0
  - defaults/win-64::more-itertools==7.2.0=py37_0
  - defaults/win-64::mpmath==1.1.0=py37_0
  - defaults/win-64::msgpack-python==0.6.1=py37h74a9793_1
  - defaults/win-64::multipledispatch==0.6.0=py37_0
  - defaults/win-64::navigator-updater==0.2.1=py37_0
  - defaults/win-64::nbconvert==5.6.0=py37_1
  - defaults/win-64::nbformat==4.4.0=py37_0
  - defaults/noarch::networkx==2.3=py_0
  - defaults/win-64::nltk==3.4.5=py37_0
  - defaults/win-64::nose==1.3.7=py37_2
  - defaults/win-64::notebook==6.0.1=py37_0
  - defaults/win-64::numba==0.45.1=py37hf9181ef_0
  - defaults/win-64::numexpr==2.7.0=py37hdce8814_0
  - defaults/win-64::numpy==1.16.5=py37h19fb1c0_0
  - defaults/win-64::numpy-base==1.16.5=py37hc3f5095_0
  - defaults/noarch::numpydoc==0.9.1=py_0
  - defaults/win-64::olefile==0.46=py37_0
  - defaults/noarch::openpyxl==3.0.0=py_0
  - defaults/noarch::packaging==19.2=py_0
  - defaults/win-64::pandas==0.25.1=py37ha925a31_0
  - defaults/win-64::pandocfilters==1.4.2=py37_1
  - defaults/noarch::parso==0.5.1=py_0
  - defaults/noarch::partd==1.0.0=py_0
  - defaults/noarch::path.py==12.0.1=py_0
  - defaults/win-64::pathlib2==2.3.5=py37_0
  - defaults/win-64::patsy==0.5.1=py37_0
  - defaults/win-64::pep8==1.7.1=py37_0
  - defaults/win-64::pickleshare==0.7.5=py37_0
  - defaults/win-64::pillow==6.2.0=py37hdc69c19_0
  - defaults/win-64::pkginfo==1.5.0.1=py37_0
  - defaults/win-64::pluggy==0.13.0=py37_0
  - defaults/win-64::ply==3.11=py37_0
  - defaults/win-64::powershell_shortcut==0.0.1=2
  - defaults/noarch::prometheus_client==0.7.1=py_0
  - defaults/noarch::prompt_toolkit==2.0.10=py_0
  - defaults/win-64::psutil==5.6.3=py37he774522_0
  - defaults/win-64::py==1.8.0=py37_0
  - defaults/win-64::py-lief==0.9.0=py37ha925a31_2
  - defaults/win-64::pycodestyle==2.5.0=py37_0
  - defaults/win-64::pycosat==0.6.3=py37hfa6e2cd_0
  - defaults/win-64::pycparser==2.19=py37_0
  - defaults/win-64::pycrypto==2.6.1=py37hfa6e2cd_9
  - defaults/win-64::pycurl==7.43.0.3=py37h7a1dbc1_0
  - defaults/win-64::pyflakes==2.1.1=py37_0
  - defaults/noarch::pygments==2.4.2=py_0
  - defaults/win-64::pylint==2.4.2=py37_0
  - defaults/win-64::pyodbc==4.0.27=py37ha925a31_0
  - defaults/win-64::pyopenssl==19.0.0=py37_0
  - defaults/noarch::pyparsing==2.4.2=py_0
  - defaults/win-64::pyqt==5.9.2=py37h6538335_2
  - defaults/win-64::pyreadline==2.1=py37_1
  - defaults/win-64::pyrsistent==0.15.4=py37he774522_0
  - defaults/win-64::pysocks==1.7.1=py37_0
  - defaults/win-64::pytables==3.5.2=py37h1da0976_1
  - defaults/win-64::pytest==5.2.1=py37_0
  - defaults/win-64::pytest-arraydiff==0.3=py37h39e3cac_0
  - defaults/win-64::pytest-astropy==0.5.0=py37_0
  - defaults/noarch::pytest-doctestplus==0.4.0=py_0
  - defaults/noarch::pytest-openfiles==0.4.0=py_0
  - defaults/win-64::pytest-remotedata==0.3.2=py37_0
  - defaults/win-64::python==3.7.4=h5263a28_0
  - defaults/win-64::python-dateutil==2.8.0=py37_0
  - defaults/win-64::python-libarchive-c==2.8=py37_13
  - defaults/noarch::pytz==2019.3=py_0
  - defaults/win-64::pywavelets==1.0.3=py37h8c2d366_1
  - defaults/win-64::pywin32==223=py37hfa6e2cd_1
  - defaults/win-64::pywinpty==0.5.5=py37_1000
  - defaults/win-64::pyyaml==5.1.2=py37he774522_0
  - defaults/win-64::pyzmq==18.1.0=py37ha925a31_0
  - defaults/noarch::qtawesome==0.6.0=py_0
  - defaults/noarch::qtconsole==4.5.5=py_0
  - defaults/noarch::qtpy==1.9.0=py_0
  - defaults/win-64::requests==2.22.0=py37_0
  - defaults/noarch::rope==0.14.0=py_0
  - defaults/win-64::ruamel_yaml==0.15.46=py37hfa6e2cd_0
  - defaults/win-64::scikit-image==0.15.0=py37ha925a31_0
  - defaults/win-64::scikit-learn==0.21.3=py37h6288b17_0
  - defaults/win-64::scipy==1.3.1=py37h29ff71c_0
  - defaults/win-64::seaborn==0.9.0=py37_0
  - defaults/win-64::send2trash==1.5.0=py37_0
  - defaults/win-64::setuptools==41.4.0=py37_0
  - defaults/win-64::simplegeneric==0.8.1=py37_2
  - defaults/win-64::singledispatch==3.4.0.3=py37_0
  - defaults/win-64::sip==4.19.8=py37h6538335_0
  - defaults/win-64::six==1.12.0=py37_0
  - defaults/noarch::snowballstemmer==2.0.0=py_0
  - defaults/win-64::sortedcollections==1.1.2=py37_0
  - defaults/win-64::sortedcontainers==2.1.0=py37_0
  - defaults/win-64::soupsieve==1.9.3=py37_0
  - defaults/noarch::sphinx==2.2.0=py_0
  - defaults/win-64::sphinxcontrib==1.0=py37_1
  - defaults/noarch::sphinxcontrib-applehelp==1.0.1=py_0
  - defaults/noarch::sphinxcontrib-devhelp==1.0.1=py_0
  - defaults/noarch::sphinxcontrib-htmlhelp==1.0.2=py_0
  - defaults/noarch::sphinxcontrib-jsmath==1.0.1=py_0
  - defaults/noarch::sphinxcontrib-qthelp==1.0.2=py_0
  - defaults/noarch::sphinxcontrib-serializinghtml==1.1.3=py_0
  - defaults/noarch::sphinxcontrib-websupport==1.1.2=py_0
  - defaults/win-64::spyder==3.3.6=py37_0
  - defaults/win-64::spyder-kernels==0.5.2=py37_0
  - defaults/win-64::sqlalchemy==1.3.9=py37he774522_0
  - defaults/win-64::statsmodels==0.10.1=py37h8c2d366_0
  - defaults/win-64::sympy==1.4=py37_0
  - defaults/noarch::tblib==1.4.0=py_0
  - defaults/win-64::terminado==0.8.2=py37_0
  - defaults/win-64::testpath==0.4.2=py37_0
  - defaults/noarch::toolz==0.10.0=py_0
  - defaults/win-64::tornado==6.0.3=py37he774522_0
  - defaults/noarch::tqdm==4.36.1=py_0
  - defaults/win-64::traitlets==4.3.3=py37_0
  - defaults/win-64::unicodecsv==0.14.1=py37_0
  - defaults/win-64::urllib3==1.24.2=py37_0
  - defaults/win-64::wcwidth==0.1.7=py37_0
  - defaults/win-64::webencodings==0.5.1=py37_1
  - defaults/noarch::werkzeug==0.16.0=py_0
  - defaults/win-64::wheel==0.33.6=py37_0
  - defaults/win-64::widgetsnbextension==3.5.1=py37_0
  - defaults/win-64::wincertstore==0.2=py37_0
  - defaults/win-64::win_inet_pton==1.1.0=py37_0
  - defaults/win-64::win_unicode_console==0.5=py37_0
  - defaults/win-64::wrapt==1.11.2=py37he774522_0
  - defaults/win-64::xlrd==1.2.0=py37_0
  - defaults/noarch::xlsxwriter==1.2.1=py_0
  - defaults/win-64::xlwings==0.15.10=py37_0
  - defaults/win-64::xlwt==1.3.0=py37_0
  - defaults/noarch::zict==1.0.0=py_0
  - defaults/noarch::zipp==0.6.0=py_0
  - defaults/win-64::_ipyw_jlab_nb_ext_conf==0.1.0=py37_0
done

## Package Plan ##

  environment location: C:\Users\mebandar\AppData\Local\Continuum\anaconda3

  added / updated specs:
    - pivottablejs


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _anaconda_depends-2020.02  |           py37_0           6 KB
    anaconda-custom            |           py37_1           3 KB
    argh-0.26.2                |           py37_0          36 KB
    autopep8-1.4.4             |             py_0          41 KB
    bcrypt-3.1.7               |   py37he774522_0          38 KB
    ca-certificates-2020.1.1   |                0         125 KB
    diff-match-patch-20181111  |             py_0          39 KB
    flake8-3.7.9               |           py37_0         188 KB
    hypothesis-5.8.3           |             py_0         228 KB
    intervaltree-3.0.2         |             py_0          25 KB
    libspatialindex-1.9.3      |       h33f27b4_0         351 KB
    matplotlib-base-3.1.3      |   py37h64f37c6_0         4.9 MB
    openssl-1.1.1g             |       he774522_0         4.8 MB
    paramiko-2.7.1             |             py_0         142 KB
    path-13.2.0                |           py37_0          37 KB
    pathtools-0.1.2            |             py_1          10 KB
    pexpect-4.8.0              |           py37_0          82 KB
    pip-20.0.2                 |           py37_1         1.7 MB
    pivottablejs-0.9.0         |           py37_0           7 KB
    pydocstyle-4.0.1           |             py_0          34 KB
    pynacl-1.3.0               |   py37h62dcd97_0         1.1 MB
    pytest-astropy-header-0.1.2|             py_0          12 KB
    python-jsonrpc-server-0.3.4|             py_0          13 KB
    python-language-server-0.31.10|           py37_0          94 KB
    pywin32-ctypes-0.2.0       |        py37_1000          41 KB
    qdarkstyle-2.8.1           |             py_0         176 KB
    rtree-0.9.3                |   py37h21ff451_0          47 KB
    ujson-1.35                 |   py37hfa6e2cd_0          24 KB
    watchdog-0.10.2            |           py37_0         112 KB
    yapf-0.28.0                |             py_0         120 KB
    ------------------------------------------------------------
                                           Total:        14.5 MB

The following NEW packages will be INSTALLED:

  _anaconda_depends  pkgs/main/win-64::_anaconda_depends-2020.02-py37_0
  argh               pkgs/main/win-64::argh-0.26.2-py37_0
  autopep8           pkgs/main/noarch::autopep8-1.4.4-py_0
  bcrypt             pkgs/main/win-64::bcrypt-3.1.7-py37he774522_0
  diff-match-patch   pkgs/main/noarch::diff-match-patch-20181111-py_0
  flake8             pkgs/main/win-64::flake8-3.7.9-py37_0
  hypothesis         pkgs/main/noarch::hypothesis-5.8.3-py_0
  intervaltree       pkgs/main/noarch::intervaltree-3.0.2-py_0
  libspatialindex    pkgs/main/win-64::libspatialindex-1.9.3-h33f27b4_0
  matplotlib-base    pkgs/main/win-64::matplotlib-base-3.1.3-py37h64f37c6_0
  paramiko           pkgs/main/noarch::paramiko-2.7.1-py_0
  path               pkgs/main/win-64::path-13.2.0-py37_0
  pathtools          pkgs/main/noarch::pathtools-0.1.2-py_1
  pexpect            pkgs/main/win-64::pexpect-4.8.0-py37_0
  pip                pkgs/main/win-64::pip-20.0.2-py37_1
  pivottablejs       pkgs/main/win-64::pivottablejs-0.9.0-py37_0
  pydocstyle         pkgs/main/noarch::pydocstyle-4.0.1-py_0
  pynacl             pkgs/main/win-64::pynacl-1.3.0-py37h62dcd97_0
  pytest-astropy-he~ pkgs/main/noarch::pytest-astropy-header-0.1.2-py_0
  python-jsonrpc-se~ pkgs/main/noarch::python-jsonrpc-server-0.3.4-py_0
  python-language-s~ pkgs/main/win-64::python-language-server-0.31.10-py37_0
  pywin32-ctypes     pkgs/main/win-64::pywin32-ctypes-0.2.0-py37_1000
  qdarkstyle         pkgs/main/noarch::qdarkstyle-2.8.1-py_0
  rtree              pkgs/main/win-64::rtree-0.9.3-py37h21ff451_0
  ujson              pkgs/main/win-64::ujson-1.35-py37hfa6e2cd_0
  watchdog           pkgs/main/win-64::watchdog-0.10.2-py37_0
  yapf               pkgs/main/noarch::yapf-0.28.0-py_0

The following packages will be UPDATED:

  ca-certificates                               2019.8.28-0 --> 2020.1.1-0
  openssl                                 1.1.1d-he774522_2 --> 1.1.1g-he774522_0

The following packages will be DOWNGRADED:

  anaconda                                   2019.10-py37_0 --> custom-py37_1



Downloading and Extracting Packages

libspatialindex-1.9. | 351 KB    |            |   0% 
libspatialindex-1.9. | 351 KB    | 4          |   5% 
libspatialindex-1.9. | 351 KB    | ###6       |  36% 
libspatialindex-1.9. | 351 KB    | #######2   |  73% 
libspatialindex-1.9. | 351 KB    | ########## | 100% 

pathtools-0.1.2      | 10 KB     |            |   0% 
pathtools-0.1.2      | 10 KB     | ########## | 100% 

ujson-1.35           | 24 KB     |            |   0% 
ujson-1.35           | 24 KB     | ######6    |  67% 
ujson-1.35           | 24 KB     | ########## | 100% 

pip-20.0.2           | 1.7 MB    |            |   0% 
pip-20.0.2           | 1.7 MB    |            |   1% 
pip-20.0.2           | 1.7 MB    | #3         |  13% 
pip-20.0.2           | 1.7 MB    | #6         |  17% 
pip-20.0.2           | 1.7 MB    | ####3      |  43% 
pip-20.0.2           | 1.7 MB    | #####5     |  55% 
pip-20.0.2           | 1.7 MB    | ######8    |  68% 
pip-20.0.2           | 1.7 MB    | ########1  |  82% 
pip-20.0.2           | 1.7 MB    | #########5 |  96% 
pip-20.0.2           | 1.7 MB    | ########## | 100% 

rtree-0.9.3          | 47 KB     |            |   0% 
rtree-0.9.3          | 47 KB     | ###4       |  34% 
rtree-0.9.3          | 47 KB     | ########## | 100% 

python-jsonrpc-serve | 13 KB     |            |   0% 
python-jsonrpc-serve | 13 KB     | ########## | 100% 

bcrypt-3.1.7         | 38 KB     |            |   0% 
bcrypt-3.1.7         | 38 KB     | ########## | 100% 

ca-certificates-2020 | 125 KB    |            |   0% 
ca-certificates-2020 | 125 KB    | ###8       |  38% 
ca-certificates-2020 | 125 KB    | ########## | 100% 

matplotlib-base-3.1. | 4.9 MB    |            |   0% 
matplotlib-base-3.1. | 4.9 MB    |            |   1% 
matplotlib-base-3.1. | 4.9 MB    | #8         |  18% 
matplotlib-base-3.1. | 4.9 MB    | ##3        |  23% 
matplotlib-base-3.1. | 4.9 MB    | ##8        |  28% 
matplotlib-base-3.1. | 4.9 MB    | ###3       |  33% 
matplotlib-base-3.1. | 4.9 MB    | ###7       |  38% 
matplotlib-base-3.1. | 4.9 MB    | ####1      |  41% 
matplotlib-base-3.1. | 4.9 MB    | ####5      |  46% 
matplotlib-base-3.1. | 4.9 MB    | ####9      |  50% 
matplotlib-base-3.1. | 4.9 MB    | #####3     |  54% 
matplotlib-base-3.1. | 4.9 MB    | #####8     |  58% 
matplotlib-base-3.1. | 4.9 MB    | ######2    |  62% 
matplotlib-base-3.1. | 4.9 MB    | ######6    |  66% 
matplotlib-base-3.1. | 4.9 MB    | #######    |  71% 
matplotlib-base-3.1. | 4.9 MB    | #######6   |  76% 
matplotlib-base-3.1. | 4.9 MB    | #######9   |  80% 
matplotlib-base-3.1. | 4.9 MB    | ########7  |  88% 
matplotlib-base-3.1. | 4.9 MB    | #########2 |  92% 
matplotlib-base-3.1. | 4.9 MB    | #########6 |  97% 
matplotlib-base-3.1. | 4.9 MB    | ########## | 100% 

paramiko-2.7.1       | 142 KB    |            |   0% 
paramiko-2.7.1       | 142 KB    | ####5      |  45% 
paramiko-2.7.1       | 142 KB    | ########## | 100% 

flake8-3.7.9         | 188 KB    |            |   0% 
flake8-3.7.9         | 188 KB    | ###3       |  34% 
flake8-3.7.9         | 188 KB    | ########## | 100% 

diff-match-patch-201 | 39 KB     |            |   0% 
diff-match-patch-201 | 39 KB     | ########1  |  81% 
diff-match-patch-201 | 39 KB     | ########## | 100% 

anaconda-custom      | 3 KB      |            |   0% 
anaconda-custom      | 3 KB      | ########## | 100% 

pytest-astropy-heade | 12 KB     |            |   0% 
pytest-astropy-heade | 12 KB     | ########## | 100% 

yapf-0.28.0          | 120 KB    |            |   0% 
yapf-0.28.0          | 120 KB    | ####       |  40% 
yapf-0.28.0          | 120 KB    | ########## | 100% 

path-13.2.0          | 37 KB     |            |   0% 
path-13.2.0          | 37 KB     | ########## | 100% 

qdarkstyle-2.8.1     | 176 KB    |            |   0% 
qdarkstyle-2.8.1     | 176 KB    | ##7        |  27% 
qdarkstyle-2.8.1     | 176 KB    | ########## | 100% 

pydocstyle-4.0.1     | 34 KB     |            |   0% 
pydocstyle-4.0.1     | 34 KB     | ########## | 100% 

hypothesis-5.8.3     | 228 KB    |            |   0% 
hypothesis-5.8.3     | 228 KB    | ##1        |  21% 
hypothesis-5.8.3     | 228 KB    | ########## | 100% 

openssl-1.1.1g       | 4.8 MB    |            |   0% 
openssl-1.1.1g       | 4.8 MB    |            |   1% 
openssl-1.1.1g       | 4.8 MB    | 5          |   5% 
openssl-1.1.1g       | 4.8 MB    | 9          |  10% 
openssl-1.1.1g       | 4.8 MB    | #3         |  14% 
openssl-1.1.1g       | 4.8 MB    | #7         |  18% 
openssl-1.1.1g       | 4.8 MB    | ##7        |  27% 
openssl-1.1.1g       | 4.8 MB    | ###1       |  32% 
openssl-1.1.1g       | 4.8 MB    | ###6       |  36% 
openssl-1.1.1g       | 4.8 MB    | ####       |  41% 
openssl-1.1.1g       | 4.8 MB    | ####4      |  45% 
openssl-1.1.1g       | 4.8 MB    | ####9      |  49% 
openssl-1.1.1g       | 4.8 MB    | #####6     |  57% 
openssl-1.1.1g       | 4.8 MB    | ######1    |  61% 
openssl-1.1.1g       | 4.8 MB    | ######5    |  66% 
openssl-1.1.1g       | 4.8 MB    | #######1   |  71% 
openssl-1.1.1g       | 4.8 MB    | ########   |  81% 
openssl-1.1.1g       | 4.8 MB    | ########6  |  86% 
openssl-1.1.1g       | 4.8 MB    | #########1 |  91% 
openssl-1.1.1g       | 4.8 MB    | #########6 |  96% 
openssl-1.1.1g       | 4.8 MB    | ########## | 100% 

intervaltree-3.0.2   | 25 KB     |            |   0% 
intervaltree-3.0.2   | 25 KB     | ########## | 100% 

pywin32-ctypes-0.2.0 | 41 KB     |            |   0% 
pywin32-ctypes-0.2.0 | 41 KB     | ########## | 100% 

python-language-serv | 94 KB     |            |   0% 
python-language-serv | 94 KB     | #####1     |  51% 
python-language-serv | 94 KB     | ########## | 100% 

pexpect-4.8.0        | 82 KB     |            |   0% 
pexpect-4.8.0        | 82 KB     | #######8   |  78% 
pexpect-4.8.0        | 82 KB     | ########## | 100% 

_anaconda_depends-20 | 6 KB      |            |   0% 
_anaconda_depends-20 | 6 KB      | ########## | 100% 

watchdog-0.10.2      | 112 KB    |            |   0% 
watchdog-0.10.2      | 112 KB    | ####2      |  43% 
watchdog-0.10.2      | 112 KB    | ########## | 100% 

pynacl-1.3.0         | 1.1 MB    |            |   0% 
pynacl-1.3.0         | 1.1 MB    | 5          |   5% 
pynacl-1.3.0         | 1.1 MB    | ##3        |  23% 
pynacl-1.3.0         | 1.1 MB    | #####4     |  55% 
pynacl-1.3.0         | 1.1 MB    | #########2 |  93% 
pynacl-1.3.0         | 1.1 MB    | ########## | 100% 

pivottablejs-0.9.0   | 7 KB      |            |   0% 
pivottablejs-0.9.0   | 7 KB      | ########## | 100% 

autopep8-1.4.4       | 41 KB     |            |   0% 
autopep8-1.4.4       | 41 KB     | ###8       |  39% 
autopep8-1.4.4       | 41 KB     | ########## | 100% 

argh-0.26.2          | 36 KB     |            |   0% 
argh-0.26.2          | 36 KB     | ####4      |  44% 
argh-0.26.2          | 36 KB     | ########## | 100% 
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done

==> WARNING: A newer version of conda exists. <==
  current version: 4.7.12
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base -c defaults conda


In [62]:
from pivottablejs import pivot_ui
In [63]:
temp = tidy_dyands.reset_index().set_index('playerShort').merge(clean_players, left_index=True, right_index=True)
In [64]:
temp.shape
Out[64]:
(373067, 9)
In [66]:
# pivot_ui(temp[['skinTone', 'position_agg', 'redcard']], )
In [67]:
# How many games has each player played in?
games = tidy_dyands.groupby(level=1).count()
sns.distplot(games);
In [68]:
(tidy_dyands.groupby(level=0)
           .count()
           .sort_values('redcard', ascending=False)
           .rename(columns={'redcard':'total games refereed'})).head()
Out[68]:
total games refereed
refNum
1214 5598
2080 4773
1942 4469
2398 4363
3099 4087
In [70]:
(tidy_dyands.groupby(level=0)
           .sum()
           .sort_values('redcard', ascending=False)
           .rename(columns={'redcard':'total redcards given'})).head()
Out[70]:
total redcards given
refNum
468 49
432 49
449 47
916 42
2073 36
In [72]:
(tidy_dyands.groupby(level=1)
           .sum()
           .sort_values('redcard', ascending=False)
           .rename(columns={'redcard':'total redcards received'})).head()
Out[72]:
total redcards received
playerShort
cyril-jeunechamp 19
sergio-ramos 19
gary-medel 16
ballesteros 15
juanfran 15
In [74]:
tidy_dyands.groupby(level=0).size().sort_values(ascending=False)
Out[74]:
refNum
1214    5598
2080    4773
1942    4469
2398    4363
3099    4087
        ... 
825        1
824        1
2620       1
2616       1
1          1
Length: 2978, dtype: int64
In [75]:
total_ref_games = tidy_dyands.groupby(level=0).size().sort_values(ascending=False)
total_player_games = tidy_dyands.groupby(level=1).size().sort_values(ascending=False)
In [76]:
total_ref_given = tidy_dyands.groupby(level=0).sum().sort_values(ascending=False,by='redcard')
total_player_received = tidy_dyands.groupby(level=1).sum().sort_values(ascending=False, by='redcard')
In [78]:
sns.distplot(total_player_received, kde=False)
Out[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce1b450388>
In [79]:
sns.distplot(total_ref_given, kde=False);
In [80]:
tidy_dyands.groupby(level=1).sum().sort_values(ascending=False, by='redcard').head()
Out[80]:
redcard
playerShort
cyril-jeunechamp 19
sergio-ramos 19
gary-medel 16
ballesteros 15
juanfran 15
In [83]:
tidy_dyands.sum(), tidy_dyands.count(), tidy_dyands.sum()/tidy_dyands.count()
Out[83]:
(redcard    3092
 dtype: int64, redcard    373067
 dtype: int64, redcard    0.008288
 dtype: float64)
In [85]:
player_ref_game = (tidy_dyands.reset_index()
                               .set_index('playerShort')
                                       .merge(clean_players,
                                              left_index=True,
                                              right_index=True)
                  )
In [86]:
player_ref_game.head()
Out[86]:
refNum redcard height weight position_agg weightClass heightClass skinTone age_years
playerShort
aaron-hughes 4 0 182.0 71.0 Defence low_wt avg_ht 0.125 33.149897
aaron-hughes 66 0 182.0 71.0 Defence low_wt avg_ht 0.125 33.149897
aaron-hughes 77 0 182.0 71.0 Defence low_wt avg_ht 0.125 33.149897
aaron-hughes 77 0 182.0 71.0 Defence low_wt avg_ht 0.125 33.149897
aaron-hughes 77 0 182.0 71.0 Defence low_wt avg_ht 0.125 33.149897
In [87]:
bootstrap = pd.concat([player_ref_game.sample(replace=True, 
                                              n=10000).groupby('skinTone').mean() 
                       for _ in range(100)])
In [88]:
ax = sns.regplot(bootstrap.index.values,
                 y='redcard',
                 data=bootstrap,
                 lowess=True,
                 scatter_kws={'alpha':0.4,},
                 x_jitter=(0.125 / 4.0))
ax.set_xlabel("Skintone");

Conclusion: skintone does play the only role in red cards.

In [ ]: